home *** CD-ROM | disk | FTP | other *** search
- October 25, 1988
-
-
- New Files in this group:
-
- ST.EXE v1.05 December 29, 1988 (minor fixes)
- EXP.EXE v1.01 December 28, 1988 (minor fixes)
-
-
- Introduction (and a little rattling of the cup...)
- --------------------------------------------------
- ST/EXP represents a new level of performance for text data
- compression, typically compressing text files to about 30% of their
- original size. However, high throughput is not sacrificed in the
- process; it is competitive in speed with the best programs based upon
- the vastly simpler Limpel-Ziv algorithm.
-
- I believe it recognizes a point seemingly passed over in all the
- other attempts at text data compression that I know of. Namely, that
- the pursuit of ever higher storage capacities without, at the same time,
- addressing more convenient information ACCESS, is ultimately a pursuit
- that will end in chaos. The word-orientation of ST/EXP's compression
- technique provides the "hooks" to address exactly this issue of
- convenient information access.
-
- If you find these programs useful, a contribution of $25 will be
- greatly appreciated. For $50, you will be registered and will receive
- the next program update, with manual, at no additional cost. Quantity
- discounts and site licensing can be worked out for anyone that gets
- really serious about it.
-
- ST/EXP was definitely NOT a hobbyist effort. I'm trying to do this
- for a living. With your support, Bartles and James has been able to
- bring you an ever-expanding array of different-flavored wine coolers.
- Also with your support, I will be able to continue to develop this basic
- technology into something that will be even more useful and interesting,
- which will ultimately save YOU money! You can send your INVESTMENT to:
-
- MicroComputer Square
- 126 Hancock Avenue
- Spartanburg, SC 29302
-
- and THANKS!!!
-
-
- Now the Manual
-
- The following programs and files are included:
-
- ST.EXE -- the FlySpeed compression program
- EXP.EXE -- " " expansion program
- FLYSPEED.MC2 -- FlySpeed's dictionary data file
-
- ST/EXP will tell you that you need to specify a file-specification, if
- you simply type their program name, followed by return.
-
-
- ST/EXP
-
- ST/EXP (es-tee-e-x-p) is really a pair of programs: ST.EXE, which
- compresses text files (we had to call it something, so we called it
- "stomp", or "st" for short), and EXP.EXE (short for "expand"), which
- re-expands the compressed file to an identical copy of the original. By
- "text", we mean the normal output of word-processing programs,
- preferably saved in a standard ASCII format (as is usually done for
- telecommunications use). Eventually, the program should be able to
- reliably compress database files as well (and maybe spreadsheet files a
- little later), but for now we intend to concern ourselves with
- compression of written language, which is actually a more difficult
- problem that really has not been adequately addressed, until now.
-
-
- How It Works
-
- Well, we can't entirely say, since some aspects of the program's
- operation constitute patentable processes, and the patent application
- hasn't been filed yet. Beyond that, the program simply doesn't work by
- any one simple "trick" or "scheme"; it is actually a collection of
- algorithms, some novel and wretchedly complex, with an efficient means
- of determining the best algorithm to use for the data immediately being
- examined. For example, the well-known (in computing circles)
- run-length-limited representation is used to optimally represent
- situations in which a particular character is repeated a number of
- times, such as a long sequence of spaces. We expect that the program
- will eventually be able to compress database files quite well, since
- spaces or "null characters" are often employed as fill characters for
- "empty fields" in database files, and these sequences can be compressed
- to just about nothing using a run-length representation. Compression of
- true text data, which is to say written language in a digital form, is
- not so easily accomplished, unfortunately.
-
- (Those of you who are particularly interested in compression of
- database and spreadsheet files should make your interests known to us.
- We would especially appreciate any actual dBase II and Lotus files you
- care to send us, since lack of actual data to use for tests is one of
- the most serious impediments to the development and refinement of that
- capability.)
-
- We have described Fly Coding as an "optimal representation of lan-
- guage", implicitly meaning "an optimal digital representation of
- written language", including punctuation, etc. The use of the word
- "optimal" naturally begs the question, "Optimal in what sense?", or
- "What parameter or parameters are being optimized?"
-
-
- A Real-Quick Tutorial on Digital Data Storage
-
- Computer people like to talk in terms of bits and bytes. A BIT is
- a Binary digIT, which can be either a 0 or a 1, easily represented
- electronically as a low (less than 1 Volt) or high (more than 3 Volts)
- voltage. Since a bit is so little (it takes about 10 to make the
- equivalent of 3 normal, base-10 digits), a byte (which is 8 bits) is
- generally more convenient to talk about.
-
- A byte can represent 2-to-the-power-of-8, or 256, different values,
- and may thus be thought of as a digit, base-256, or by "hexadecimal"
- notation: two digits, base-16.
-
- If, say, the Declaration of Independence, stored as a normal text
- file, contained 10,000 bytes, then, in effect, we are using a 10,000
- digit number, base-256, to represent the information contained in the
- Declaration of Independence (a number vastly larger than the number of
- subatomic particles in the universe). This would be exactly equivalent
- to 20,000 digits, base-16, and approximately 24,000 digits, base-10
- (since "5 digits", 100,000 base-16, is 1,048,576 or about "6 digits",
- base-10).
-
- What we would like to do is to find a number with significantly
- fewer digits which can represent the same information, thereby reducing
- the space required to store the number, whether we are storing it on a
- piece of paper, or a magnetic or optical disk (and also reducing the
- time it will take to transmit the number by modem). If we can reduce
- the number of digits needed by a factor of 4, then, in effect, we are
- transforming the number into a second number which is approximately the
- fourth root of the first (we must, of course, be able to transform the
- second number identically back into the first).
-
- We would like to say that, in producing an "optimal representation
- of language", we have found a way of finding the minimum number which
- can represent a given document. We think we can make this statement,
- with the qualification that the technique should not resort to
- natural-language comprehension in order to perform the compression.
-
- This qualification hints at the other parameters which are of
- concern: namely, the speed at which the process operates, as well as its
- memory requirements, reliability, and certainly the feasibility of
- accomplishing the desired goal before starvation set in. In other
- words, what is being optimized is the USEFULNESS of the technique, which
- is a more subtle issue than just optimizing the compression factor.
- Nonetheless, we feel that we were able to avoid all, but a few minor
- compromises between high compression factor and high throughput.
-
-
- But How Does It Work?
-
- In order to discuss "how it works" adequately, some discussion of
- "what information is", from an epistemological perspective, is in order.
- But since we've probably already lost some of you, we'd really rather
- not go into that...
-
- However, an explanation that we think will satisfy a lot of people,
- and which really does explain much of how the program works, is simply
- this:
-
- The program uses a "dictionary", contained in the file
- "flyspeed.mc2", to convert words found in a document to numbers, which
- take up less room than the original word, as represented according to
- the standard convention.
-
- The dictionary, itself, employs data-compression techniques, as do
- the dictionaries used in spelling checkers that come with most
- word-processing packages. The difference is that, while the
- data-compression techniques employed in common spelling checkers are
- oriented toward anachronistic concerns over saving memory, the
- data-compression techniques employed in FlySpeed's dictionary are geared
- toward high-speed lookup. It is for this reason that FlySpeed's
- "tokenizing engine" runs from 19 (Wordperfect) to 115 (Borland's Turbo
- Lightning) times faster than typical spelling checkers on the market.
-
- A writer of an ad which appeared recently in a major
- personal-computing publication, for a board which compresses database
- files, had "cutesie" compression programs mashing, mangling, and
- otherwise doing terrible things to data. (We're not sure what's so
- different about having the microprocessor on your computer's
- motherboard do the compression, from having a microprocessor on a
- plug-in board, if you have any plug-in slots left... do it.) It fairly
- well pointed up the fact that most people haven't got a clue as to what
- data-compression really means.
-
- In a nutshell, data-compression simply means finding a more
- efficient representation of data than the representation commonly in
- use, and being able to translate back and forth between the two. There
- is usually nothing which intrinsically recommends "the representation
- commonly in use", other than simplicity and expediency. We feel that,
- given the power and sophistication of modern microcomputers,
- simplicity and expediency no longer constitute sufficient justification
- for the use of ASCII (American Standard Code for Information
- Interchange), at least not for the purposes of information storage and
- communication. Furthermore, as businesses are beginning to build up
- megabytes of records and documents on their computers, there is a need
- not only to address efficient information storage and communication, but
- efficient information retrieval, as well.
-
-
- Running ST/EXP
-
- ST and EXP must be run from the directory in which they are
- located (you can't simply put them in some directory named in a "path
- command", and then run them from whatever directory you happen to be
- in).
-
- We could eliminate this restriction, and we'll do so if people
- scream for it, but the programs load somewhat faster when DOS doesn't
- have to look very far for them. For short documents, the time it takes
- DOS simply to load ST/EXP, can be a significant portion of the program's
- "run time".
-
- Besides, you'll compress files after you've finished creating them,
- at which time you'll probably be in your word-processing directory, so
- your word-processing directory is the logical place to put ST/EXP.
-
- In order to compress a file, simply type:
-
- ST filename
-
- More generally, you can type:
-
- ST filespec1 ... filespecN
-
- in order to compress all files having those file-specifications. For
- example, you would type:
-
- ST \mary\*.DOC john\*.TXT gossip?.*
-
- in order to compress all files in the directory "mary", having the
- extension "DOC"; all files in the subdirectory "john", of the current
- directory, having the extension "txt", as well as all files in the
- current directory with any extension, and whose first six characters are
- "gossip".
-
- Expansion works in a similar fashion; however, exp always looks
- for files having a "prs" extension (which is the extension assigned by
- st to its compressed output file). Thus, in order to expand the files
- above, you would type:
-
- exp \mary\* john\* gossip?
-
-
- A Nasty Footnote about Microsoft Word
-
- We hope this isn't going to get to be a regular thing with
- Microsoft Word, but we need to add a footnote...
-
- If you get out your calculator and divide the file-size of your
- Word documents by 512, you'll always end up with an integer. This is
- not due to some great coincidence of the Cosmos. Since the sector size
- used by IBM PCs is 512 bytes, they may have decided that it doesn't
- "hurt anything" to do this. Even if you plan to transmit the document
- by modem, with a 1200 baud modem, it would take only about an extra 4
- seconds to transmit the maximum 511 bytes of unnecessary junk left at
- the end of the file.
-
- However, since ST/EXP is not at liberty to decide what in a file is
- and is not important, when it encounters this junk, it must faithfully
- deal with it. This can have a disproportionate impact in the
- compression one may achieve with Microsoft Word files, particularly for
- files with perhaps only 1000 words, and 511 bytes of junk at the end.
-
- One solution is to store your Word documents on disk in the
- standard ASCII format, by setting the printer to "plain", and printing
- to a file as described in the Word manual. You may need to do this
- anyway, if you are sending electronic mail to someone who doesn't have
- Microsoft Word. However, Word is not especially fast at saving docu-
- ments in text form.
-
- The best solution, which is out of your hands as well as ours, is
- for Microsoft to set the file-size to its real value, or, at least, set
- all the junk to spaces or nulls.
-
- Where all this junk comes from, this writer isn't entirely certain.
- I noticed once, that one Word file I'd just written had junk from a
- previous file I'd been working on, which suggests that the junk is
- simply whatever happened to be in some particular section of memory when
- the file was saved.
-
- So, if you're using Microsoft Word to write some electronic mail
- to someone you and a co-conspirator are planning to murder, and you're
- also writing to your co-conspirator, write your letter to your
- co-conspirator AFTER you write the letter to your intended victim.
-
- Conversely, if you suspect that someone is conspiring to murder
- you, and you get some electronic mail from them written with Word, type
- the file to the screen to display any junk which might be at the end.
- It may just contain a fragment of some mail written previously to their
- co-conspirator, which may give their plan away, and provide evidence for
- the authorities.
-
- In addition to this junk at the end of a Word file, there is also
- a 128-byte header at the beginning of Word files, which appears to
- contain useful information (at least useful to Word itself), such as
- what the righthand margin setting for the file is. Of course, having
- to deal with this non-language information also can hurt the compression
- attainable with Microsoft Word, though it doesn't cause that much of a
- problem for longer files.
-
-
- ----------------end-of-author's-documentation---------------
-
- Software Library Information:
-
- This disk copy provided as a service of
-
- The Public (Software) Library
-
- We are not the authors of this program, nor are we associated
- with the author in any way other than as a distributor of the
- program in accordance with the author's terms of distribution.
-
- Please direct shareware payments and specific questions about
- this program to the author of the program, whose name appears
- elsewhere in this documentation. If you have trouble getting
- in touch with the author, we will do whatever we can to help
- you with your questions. All programs have been tested and do
- run. To report problems, please use the form that is in the
- file PROBLEM.DOC on many of our disks or in other written for-
- mat with screen printouts, if possible. The P(s)L cannot de-
- bug programs over the telephone.
-
- Disks in the P(s)L are updated monthly, so if you did not get
- this disk directly from the P(s)L, you should be aware that
- the files in this set may no longer be the current versions.
-
- For a copy of the latest monthly software library newsletter
- and a list of the 1,800+ disks in the library, call or write
-
- The Public (Software) Library
- P.O.Box 35705 - F
- Houston, TX 77235-5705
- (713) 665-7017
-